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We show how to automatically construct a system that satisfies a given logical specification and has 
an optimal average behavior with respect to a specification with ratio costs. 

When synthesizing a system from a logical specification, it is often the case that several different 
systems satisfy the specification. In this case, it is usually not easy for the user to state formally 
which system she prefers. Prior work proposed to rank the correct systems by adding a quantitative 
aspect to the specification. A desired preference relation can be expressed with (i) a quantitative 
language, which is a function assigning a value to every possible behavior of a system, and (ii) 
an environment model defining the desired optimization criteria of the system, e.g., worst-case or 
average-case optimal. 

In this paper, we show how to synthesize a system that is optimal for (i) a quantitative language 
given by an automaton with a ratio cost function, and (ii) an environment model given by a labeled 
Markov decision process. The objective of the system is to minimize the expected (ratio) costs. The 
solution is based on a reduction to Markov Decision Processes with ratio cost functions which do not 
require that the costs in the denominator are strictly positive. We find an optimal strategy for these 
using a fractional linear program. 

1 Introduction 

Quantitative analysis techniques are usually used to measure quantitative properties of systems, such 
as timing, performance, or rehabihty (cf. |2] HI] IH). We use quantitative reasoning in the classically 
Boolean contexts of verification and synthesis because they allow us to distinguish systems with respect 
to "soft constraints" like robustness lITTI or default behavior lITOl . This is particularly helpful in synthesis, 
where a system is automatically derived from a specification, because quantitative specifications allow 
us to guide the synthesis tool towards a desired implementation. 

In this paper we show how quantitative specifications based on ratio objectives can be used to guide 
the synthesis process. In particular, we present a technique to synthesize a system with an average- 
case behavior that satisfies a logical specification and optimizes a quantitative objective given by a ratio 
objective. 

The synthesis problem can be seen as a game between two players: the system and the environment 
(the context in which the system operates). The system has a fixed set of interface variables with a 
finite domain to interact with its environment. The variables are partitioned into a set of input and 
output variables. The environment can modify the set of input variables. For instance, an input variable 
can indicate the arrival of some packet on a router on a given port or the request of a client to use a 
shared resource. Each assignment to the input variables is a possible move of the environment in the 
synthesis game. The system reacts to the behavior of the environment by changing the value of the 
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output variables. An assignment to the output variables is called an action of the system and describes 
a possible move of the system in the synthesis game. E.g., the system can grant a shared resource to 
Client C by setting a corresponding output variable. Environment and system change their variables in 
turns. In every step, first the system makes modification to the output variables, then the environment 
changes the input variables. The sequence of variable evaluations built up by this interplay is evaluated 
with respect to a specification. A logical (or qualitative) specification maps every sequence to 1 or 0, 
indicating whether the sequence satisfies the specification or not. For example, a sequence of evaluations 
in which the system grants a shared resource to two clients at the same time is mapped to if the 
specification requires mutual exclusive access to tliis resource. The aim of the system in the synthesis 
game is to satisfy the specification independent of the choices of the environment. There might be several 
systems that can achieve this goal for a given specification. Therefore, Bloem et al. [10] proposed to add 
a quantitative specification in order to rank the correct systems. A quantitative specification maps every 
infinite sequence of variable evaluations to a value indicating how desirable this behavior is. In this 
paper, we study quantitative specifications resulting from ratio objectives. The idea is that a behavior of 
the system is mapped to two infinite sequences of values. The first sequence refers to events that were 
"good" for the system, while the second sequence refers to "bad" events witliin a behavior For instance, 
consider a server processing requests from several clients . If the server receives a request it can be seen 
as a bad event, since it requires the server to process the request. On the other hand, every handled 
request is clearly a good event. Intuitively, the ratio objectives computes the long-run ratio between the 
sum of bad and the sum of good events. This ratio is the value of a behavior A system can be seen 
as a set of behaviors. We can assign a value to a system by taking, e.g., the worst or the average value 
over all its behaviors. Given a way to evaluate a system, we can ask for a system that optimizes this 
value, i.e., a system that achieves a better value than any other system. Taldng the worst value over the 
possible behaviors corresponds to assuming that the system is in an adversary environment. The average 
value is computed with respect to a probabilistic model of the environment |[T5l . In the average-case 
synthesis game, the environment player is replaced by a probabilistic player that is playing according to 
the probabilistic environment model. 

In this paper, we present the first average-case synthesis algorithm for specifications that evaluate a 
behavior of the system with respect to the ratio of two cost functions ifTOl . This ratio objective allows us, 
e.g., to ask for a system that optimizes the ratio between requests and acknowledgments in a server-client 
system. For the average-case analysis, we present a new environment model, which is based on Markov 
decision processes and generalizes the one in [15 J. We solve the average-case synthesis problem with 
ratio objective by reduction to Markov decision processes with ratio cost functions. For unichain Markov 
Decision Processes with ratio cost functions, we present a solution based on linear programming. 

Related Work. Researchers have considered a number of formalisms for quantitative specifications ISl 
[111 [131 [in 111 [3l Uni |22l 1211 but most of them (except for [11]) do not consider long-run ratio objectives. 
In |[m . the environment is assumed to be adversary, while we assume a probabilistic environment model. 
Regarding the environment model, there have been several notions of metrics for probabilistic systems 
and games proposed in the literature [[4] [19]. The metrics measure the distance of two systems with 
respect to all temporal properties expressible in a logic, whereas we (like iflSl ) uses the quantitative 
specification to compare systems wrt the property of interest. In contrast to |[T5l . we use ratio objec- 
tives and a more general environment model. Our environment model is the same as the one used for 
control and synthesis in the presence of uncertainty (cf. [|6l |9|). However, in this context usually 
only qualitative specifications are considered. MDPs with long-run average objectives are well studied. 



C. von Essen & B. Jobstmann 



19 



The books ||23l [30]| present a detailed analysis of this topic. Cyrus Derman |[T8]| studied MDPs with a 
fractional objective. This work differs in two aspects from ours: first, Derman requires that the payoff of 
the cost function of the denominator is always strictly positive and second, the objective function used 
in ifTSl is already given in terms of the expected cost of the first cost function to the expected cost of the 
second cost functions and not in terms of a single trace. De Alfaro ||T] studies a model that is similar to 
ours but does not consider the synthesis problem. Finally, we would like to note that the two choices we 
have in a quantitative synthesis problem, namely the choice of the quantitative language and the choice 
of environment model are the same two choices that appear in weighted automata and max-plus algebras 
(cf. lEHIMIIIl). 

2 Preliminaries 

Words, Qualitative and Quantitative Languages. Given a finite alphabet Z, a word w = wqWi ... is 
a finite or infinite sequence of elements of £. We use w, to denote the (/+ l)-th element in the sequence. 
If w is finite, then \w\ denotes the length of w, otherwise \w\ is infinity. We denote the empty word by e, 
i.e., |e| =0. We use £* and to denote the set of finite and infinite words, respectively. Given a finite 
word w G r* and a finite or infinite word v G Z* U Z®, we write wv for the concatenation of w and v. A 
qualitative language <p is a function (p : — > B mapping every infinite word to 1 or 0. Intuitively, a 
qualitative language partitions the set of words into a set of good and a set of bad traces. A quantitative 
language [14] i/a is a function i/a : £® ^ M+ U {oo} associating to each infinite word a value from the 
extended non-negative reals. 

Specifications and automata with cost functions. An automaton is a tuple = {l.,Q,qo,5,F), 
where £ is a finite alphabet, 2 is a finite set of states, qo € 2 is an initial state, 5 : Q xL ^ Q is 
the transition function, and f C 2 is a set of safe states. We use 5* : S x L* ^ S to denote the closure of 
5 over finite words. Formally, given a word w = wq • • • G £*, 5* is defined inductively as 5* {q, e) = q, 
and 5*{q,w) = 5{5*{q,wo . . .h'„_i),w„). We use to denote the size of the automaton. 

The run p of jz/ on an infinite word w = woWiW2 • • • G is an infinite sequence of states qQqiq2 ■ ■ ■ 
such that qo is the initial state of £/ and V/ > : 5{qi,Wi) = qt+i holds. The run p is called accepting if 
for all / > 0, qi £ F. A word w is accepting if the corresponding run is accepting. The language of £/, 
denoted by ^^z, is the qualitative language : Z'" — )• IB mapping all accepting words to 1 and non- 
accepting words to 0, i.e., is the characteristic function of the set of all accepting words of £/. We 
assume without loss of generality that Q\F is closed under 5, i.e., \/s £ Q\F,\/a £L : 5{s,a) £ Q\F. 
Note that every automaton can be modified to meet this assumption by (i) adding a new state q± with a 
self-loop for every letter and (ii) redirecting every transition starting from Q\F to the new state q±. The 
modified automaton accepts the same language as the original automaton. 

Given an automaton £/ = (£, Q,qo, S,F), a cost function c:2x£^Nisa function that maps every 
transition in js/ to a non-negative integer. We use automata with cost functions and objective functions to 
define quantitative languages (or properties). Intuitively, the objective function tells us how to summarize 
the costs along a run. Given an automation £/ and two cost functions ci,C2, the ratio objective ifm 
computes the ratio between the costs seen along a run of jz/ on a word w = woW\W2 ■■■ G S"': 

^(w) := lim hminf lU^i (^*(^o->^o • ' -^^O-^^'+O (i) 

m^oo l+lU,C2(5*(?0,Wo...W,-),W,-+i) 

The ratio objective is a generalization of the long-run average objective (also known as mean-payoff 
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objective, cf. fS3\ ). We use to denote the quantitative language defined by £/, ci, cj, and the ratio 



objective function. If £/, ci, or C2 are clear from the context, we drop them. 

Intuitively, ^ computes the long-run ratio between the costs accumulated along a run. The first limit 
allows us to ignore a finite prefix of the run, which ensures that we only consider the long-run behavior. 
The 1 in the denominator avoids division by 0, if the accumulated costs are and has no effect if the 
accumulated costs are infinite. We need the limit inferior here because the sequence of the limit might not 
converge. Consider the sequence p = q^r^q^r^q^^ where q'' means that the State q is visited ^-times. 
Assume State q and State r have the following costs: ci{q) = 0, C2{q) = 1, ci(r) = 1 and C2(r) = 1. 
Then, the value of po • • • P/ will alternate between and 1 with increasing / and hence the sequence for 
/ oo will not converge. The limit inferior of this sequence is 0. 

Finite-state system and Correctness A finite-state system y = {S,L,so,A,d,T) consists of the au- 
tomaton £/ = {L,S,so, d,s j^ an output (or action) alphabet A, and an output fiinction T : 5 — ?> A assign- 
ing to each state of the system a letter from the output alphabet. The alphabet of the automaton L is 
called the input alphabet of the system. Given an input word w, the run of the system y on the word w 
is simply the run of £/ on the word w. For every word w over the input alphabet, the system produces 
a word over the joint input/output alphabet. We use to denote the function mapping input words 
to the joint input/output word, i.e, given an input word w = wqWi • • • G L®, 0' y'{w) is the sequence of 
tuples {lo,ao){h,<3i) ■ ■■ & {L xA)'" such that (i) = w,- for all / > 0, (ii) ao = ^{so), and (iii) for all / > 0, 
ai = z{5*{so,WQ...Wi-i))) holds. 

Given a system y with input alphabet L and output alphabet A, and an automaton £/ with alphabet 
E = L X A, we say that the system y satisfies the specification £/, denoted y \= £/, if for all input 
words, the joint input/output word produced by the system y is accepted by the automaton £/, i.e., 
Vw G L'^ : {y^i/ o )(w) = 1, where o denotes the function composition operator. 

Probability space. We use the standard definitions of probability spaces. A probability space is given 
by a tuple ^ := {Q.,^ where Q. is the set of outcomes or samples, ^ C 2^ is the a-algebra defining 
the set of measurable events, and /X G ^ ^ [0, 1] is a probability measure assigning a probability to 
each event such that lx{Q) = 1 and for each countable set £'i,£'2, • • ■ G =^ of disjoint events we have 
piiyjEi) = Y,i^{Ei). Recall that, since ^ is a a-algebra, it satisfies the following three conditions: (i) 
G (ii) £■ G ^ implies Q.\E ^ fox any event E, and (iii) the union of any countable set of events 
E\,E2, • • • G =^ is also in i.e., IJ^^r G Given a measurable function / : ^ — > MU {+0°, —°°}, we 
use IE;^[/] to denote the expected value of / under /i, i.e.. 



If ^ is clear from the context we drop the subscript or replace it with the structure that defines 3^. The 
integral used here is the Lebesgue Integral, which is commonly used to define the expected value of a 
random variable. Note that the expected value is always defined if the function / maps only to values in 

M+U{oo}. 

Markov chains and Markov decision processes (MDP). Let &{S) :={/?: 5 [0, 1] | ^^^^5 p{s) = 1} 
be the set of probability distributions over a set 5. 

'Note that the last element of this tuple is the set of safe states, i.e., every state is safe. 




(2) 
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A Markov decision process is a tuple ^ = {S,so,A,A,p), where 5 is a finite set of states, sq ^ S is 
an initial state, A is the finite set of actions, A : 5 ^ 2^* is the enabled action function defining for each 
state s the set of enabled actions in s, and p : 5 x A — > &{S) is the probabilistic transition function. For 
technical convenience we assume that every state has at least one enabled action, i.e., V* G 5 : |A(j)| > 1. 
If \A{s)\ = 1 for all states 5 G 5, then ^ is called a Markov chain (MC). In this case, we omit A and A 
from the definition of Given a Markov chain we say that is irreducible if every state can be 
reached from any other. We say that it is unichain if it has at most one maximal set of states that can 
reach it other. We call an MDP unichain if every strategy induces a unichain MC. 

An L-labeled Markov decision process is a tuple ^ = {S,so,A,A,p,X), where {S,so,A,A,p) is a 
Markov decision process and A : 5 —5- L is a labeling function such that ^ is deterministic with respect 
to A, i.e, for all states s,s',s" and every action a such that s' / s", p{s,a)(s') > and p{s,a){s") > we 
have ^{s') / A(>y"). Since we use L-labeled Markov decision process to represent the behavior of the 
environment, we require that in every state all actions are enabled, i.e., G 5 : A(>s) = A. 

Sample runs and strategies A (sample) run p of ^ is an infinite sequence of tuples (^o, fl^o) (■^b ) • • • £ 
(S X A)'* of states and actions such that for all / > 0, (i) a,- G A{si) and (ii) p{si,ai){si+i) > 0. We use £1 
to denote the set of all runs, and D-g for the set of runs starting at state s. A finite run of ^ is a prefix 
of some infinite run. To avoid confusion, we use v to refer to a finite run. Given a finite run v, the set 
Y{v) := {p G I 3p' eQ: p = vp'} of all possible infinite extensions of v is called the cone set of v. We 
use the usual extension of 7(-) to sets of finite words. 

A strategy is a function 7Z : (5 xA)*5 — >■ ^(A) that assigns a probability distribution to all finite 
sequences in (5 x A)*S. A strategy must refer only to enabled actions, i.e., for all sequences w G (5 x A)*, 
states 5 G 5, and actions a G A, if n{ws)(a) > 0, then action a has to be enabled in s, i.e., a G A(5'). A 
strategy n is pure if for all finite sequences w G (5 x A)* and for all states s € S, there is an action a G A 
such that 7r{ws){a) = 1. A memoryless strategy is independent of the history of the run, i.e., for all 
w,w' G (5 X A)* and for all j G 5, Tt{ws) = ;r(w'j) holds. A memoryless strategy can be represented as 
function n : S ^ A pure and memoryless function can be represented by a function tt : 5 — > A 

mapping states to actions. An MDP ^ = (5,.vo,A,A,/)) together with a pure and memoryless strategy 
n\S ^ A defines the Markov chain .M'^ = (5,5'o,A,A;i:,p), in which only the actions prescribed in the 
strategy K are enabled, i.e., A^^is) = {7c{s)}. Note that every finite-state system with input alphabet 5 
and output alphabet A that refers only to enabled actions can be viewed as a strategy for Vice-versa, 
an MDP with a pure and memoryless strategy n defines a finite state system with input alphabet S 
and output alphabet A. 

Induced probability space, objective function, and optimal strategies. An MDP ^ = {S,SQ,A,A,p) 
together with a strategy n and a state s induces a probability space ^ = ^ > ^ ' ,) o'^^r 
the cone sets of the runs starting in s. Hence, il'^ ^, =5®. The probability measure of a cone set is the 
probability that the MDP starts from state s and follows the common prefix under the strategy n. By 
convention := If ^ is a Markov chain, then n is fixed (since there is only one available 

action in every state), and we simply write ^ ^. 

An objective function of ^ is a measurable function / : (5 x A)® — > M+ U {°°} that maps runs of ^ 
to values in M+ U {oo}. We use to denote the expected value of / wrt the probability space 

induced by the MDP a strategy k, and a state s. 

We are interested in a strategy that has the least expected value for a given state. Given an MDP ^ 
and a state s, a strategy Ti is called optimal for objective / and state s if IE^^[/] = min;j/ ^[^], 
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where n' ranges over all possible strategies. 

Given an MDP ^ = {S,so,A,A,p) and two cost function ci : 5 x A — ;> N and C2 : 5 x A N, the 
ratio payoff value is the function ^ : (5 x A)® — )• M+ U {oo} mapping every run p to a value in M+ U {oo} 
as follows: 

^a(p) := Hm liminf ^^^a^i^^ (3) 
We drop the subscript ^ if c\ and C2 are clear from the context. 

3 Synthesis with Ratio Objective in Probabilistic Environments 

In this section, we first present a variant of the quantitative synthesis problem introduced in [10|. Then, 
we show how to solve the synthesis problem with safety and ratio specifications in a probabilistic envi- 
ronment described by an MDP. 

The quantitative synthesis problem with probabilistic environments asks to construct a finite-state 
system ^ that satisfies a qualitative specification and optimizes a quantitative specification under the 
given environment. The specifications are qualitative and quantitative languages over letters in (L x A), 
where L and A are the input and output alphabet of y, respectively. 

In order to compute the average behavior of a system, we assume a model of the environment. In |T5l, 
the environment model is a probability space ^ = (L'",^,/x) over the input words of the system 
defined by a finite L-labeled Markov chain. This model assumes that the behavior of the environment 
is independent of the behavior of the system, which restricts the modeling possibilities. For instance, a 
client-server system, in which a client increases the probability of sending a request if it has not been 
served in the previous step, cannot be modeled using this approach. Therefore, our environment model 
is a function /g that maps every system : L* — )• A to a probability space ^ = (L'*', /i) over the input 
words L®. Note that every finite-state system defines such a system function fs but not vice versa. To 
describe a particular environment model fe, we use a finite L-labeled Markov decision process. Once we 
have an environment model, we can define what it means for a system to satisfy a specification under a 
given environment. 

Definition 1 (Satisfaction). Given a finite-state system .5^ with alphabets L and A, a qualitative specifi- 
cation (p over alphabet LxA, and an environment model fe, we say that 5^ satisfies (Jp under ( written 
\=fe "fS* iff satisfies (p with probability 1, i.e.. 

Recall that iffy^ denotes the function that maps input words to joint input/output words, and that 
<p is a qualitative specification, which maps (input/output) words to or 1. Hence, cp o tffy^ denotes the 
function that maps an input sequence to 1 if the behavior of the system for this input word satisfies the 
specification (p. Otherwise, the input word is mapped to 0. The function Ey^(y )[/] of some measurable 
function / denotes the expected value of / under the probability distribution induced by the system ^ 
under the environment model /g. Hence, Definition [T] says that a system satisfies a specification under a 
probabilistic environment model if almost all behaviors of the system satisfy the specification, i.e., the 
probabihty that the system misbehaves is 0. 

^Note that =5^ |=/, (p and \= (p coincide if (i) (p is prefix-closed (whicli is tiie case for tiie specifications, we consider 
fiere), and (ii) fe{-y) assigns, for every finite word w £ L*, a positive probability to tire set of infinite words wL'". 
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(a) Automaton stating mutual exclusion (b) Automaton with cost fcts for client i 



Figure 1 : Specifications for tfie client-server example 



Next, we define the value of a system with respect to a specification under an environment model 
and what it means for a system to optimize a specification. Then, we are ready to define the quantitative 
synthesis problem. 

Definition 2 (Value of a system). Given a finite-state system .5^ with alphabets L and A, a qualitative ((p) 
and a quantitative specification (y) over alphabet LxA, and an environment model fe, the value of ^ 
with respect to (p and ^ under is defined as the expected value ofthefimction ^.y in the probability 
space fe{y), if '5^ satisfies (p, and °° otherwise. Formally, 

I o° Otherwise. 

If (p is the set of all words, then we write Value\^{j>^). Furthermore, we say 5^ optimizes wrt fe, if 
Value{^{y) < Value{f{y") for all systems y". 

Definition 3 (Quantitative realizability and synthesis problem). Given a qualitative specification (p and 
a quantitative specification Y over the alphabets LxA and an environment model fe, the realizabil- 
ity problem asks to decide if there exists a finite-state system 5^ with alphabets L and A such that 
Value{^^,{S^) ^ oo. The synthesis problem asks to construct a finite-state system y (if it exists) s. t. 

1. 'Value}^^f{y) 7^ oo and 

2. y optimizes Y wrt fe. 

In the following, we give an example of a quantitative synthesis problem. 



Server-client example. Consider a server-client system with two clients and one server. Each server- 
client interface consists of two variables r, (request) and a, (acknowledge). Client / sends a request by 
setting r, to 1. The server acknowledges the request by setting a, to 1. We require that the server does 
not acknowledge both clients at the same time. Hence, our qualitative specification demands mutual 
exclusion. Figure [T(a)] shows an automaton stating the mutual exclusion property for a\ and a2. Edges 
are labeled with sets of evaluations of a\ and ai, e.g., a\ states that a\ has to be and a2 can have 
either value, 1 and 0. States drawn with a double circle are safe states. Among all systems satisfying the 
mutual exclusion property, we ask for a system that minimizes the average ratio between requests and 
useful acknowledgments. An acknowledge is useful if it is sent as a response to a request. To express this 
property, we can give a quantitative language defined by an automaton with two cost functions (ci,C2) 
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and the ratio objective (Eqn. [T]). Figure [T(b)] shows an automaton labeled with tuples representing the 
two cost functions c\ and C2 for one client. The first component of the tuples represents cost function c\, 
the second component defines cost function C2- The cost function ci is 1, whenever we see a request. 
The cost function C2 is 1, when we see a "useful" acknowledge, which is an acknowledge that matches 
an unacknowledged request. E.g., every acknowledge in state is useful, since the last request has not 
been acknowledged yet. In state only acknowledgments that answer a direct request are useful and get 
cost 1 (in the second component). This corresponds to a server with a buffer that can hold exactly one 
request and that gets outdated after two steps and has to be dropped. State si says that there is a request 
in the buffer. If there is no acknowledgment while the machine is in this state, then the request is lost. 
This means that a request has to be acknowledged in the step it is received or in the step after that. 

Assume we know the expected behavior of the clients. E.g., in every step. Client 1 is expected to 
send a request with probability 0.5 independent of the acknowledgments. Client 2 changes its behavior 
based on the acknowledgments. We can describe the behavior of Client 2 by the labeled MDP shown 
in Figure [2(a)| In the beginning the chance of getting a request from this client is 0.5. Once it has sent 
a request, i.e., it is in state r, the probability of sending a request again is very high until at least one 
acknowledgment is given. This is modeled by action g at state r having a probability of 3/4 to get into 
state r again, and a probability of 1/4 to not send a request in the next step. In this case, we move to the 
right r state. In this state, the probability of receiving a request from this client in the next step is even 
7/8. This means that if this client does not receive an acknowledgment after having sent a request, then 
the possibility of receiving another request from this client in the next two steps is 1 — 1/4 * 1/8 = 31/32. 

Consider the finite-state system shown in Figure [2(b)| It is an implementation of a server for two 
clients. The system has two states mo and mi labeled with aia2 and aiaj, respectively. We can compute 
the value of y using the following two lemmas (Lem.[T] Lem.O. 
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Lemma 1. Given (i) a finite-state system with alphabets L and A, (ii) an automaton with alphabet 
LxA, and ( Hi) an L-labeled MDP ^ defining an environment model for 5^, there exists a Markov chain 
and two cost functions ci and C2 such that 

Proof idea: The Markov chain is constructed by taking the synchronous product of y , , 
and ^ . In every state {s,q,m) G {Sy x Qj^ x 5^), we take the action a £ A given by the labeling 
function of the system z{s) and move to a successor state for every input label Z G L such that there exists 
a state m' in the MDP ^ with A(m') = / and p{m,a){m') > 0. The corresponding successor states of 
the system- and the automaton-state components are s' = 5y{s,l) and q' = 5^^{q, {I, a)). The probability 
distribution of is taken from the ^-component. The two cost functions are defined as follows: for 
state {s,q,m) and an action a we set ci{{s,q,m),a) = and C2{{s,q,m),a) = 1, if g is a safe state in £/, 
otherwise ci{{s,q,m),a) = 1 and C2{{s,q,m),a) = 0. Intuitively, since the non-safe states of are (by 
definition) closed under Sg/ and all actions in this set have the same cost, they all have the same value, 
namely oo, so does every state from which there is a positive probability to reach this setJl 

Lemma 2. Given (i) a finite-state system ,y with alphabets L and A, (ii) an automaton £/ with alphabet 
LxA with two cost functions ci and C2, and (Hi) a L-labeled MDP ^ defining an environment model 
for .5^ , there exists a Markov chain and two cost functions d\ and d2 such that 

Valuei^ {y) °='^ E-^[=^£, o^^] = E^Ji^^, ] 



Proof idea: The construction is the same as the one for Lem. [T] except for the cost functions. The 
cost functions are simply copied from the component referring to the automaton, e.g., given a state 
{s,q,m) S {S,y X 2^ x S^^) and an action a (z A, di{{s,q,m),a) = ci (q) and d2{{s,q,m),a) = C2{q)- 

In SectionlH we show how to compute an optimal value for MDPs with ratio objectives in polynomial 
time. Since Markov chains with ratio objectives are a special case of MDPs with ratio objectives, we can 
first use Lem.dJto check if y ^s^- If the check succeeds, we then use Lem.[2]to compute the value 
Value^^ This algorithm leads to the following theorem. 

Theorem 1 (System value). Given a finite-state system with alphabets L and A, an automaton si 
with alphabet LxA defining a qualitative language, an automaton ^ with alphabet LxA and two cost 
functions c\ and C2 defining a quantitative language, and a L-labeled MDP ^ defining an environment 
model, we can compute value of 5^ with respect to and under =^"2- polynomial in the 

maximum of \y\ ■ \s>^\ ■ \J(\ and ■ ■ \J{\. 

In order to synthesize an optimal system, we construct an MDP from the environment model, the 
quantitative, and qualitative specifications similar to the constructions in Lem. [U and |2l Any optimal 
strategy for this MDP with a value different from oo corresponds to a system that satisfies the qualitative 
specification and optimizes the quantitative specifications. In the next section, we will show that MDPs 
with ratio objectives have pure memoryless optimal strategies. Therefore, we need to consider only 
such strategies that are pure and memoryless. Given a pure and memoryless strategy, we build the 
corresponding system as follows: we reduce the set of enabled actions in each state to the single action 



■'Note that instead of an MDP with ratio objective, we could have also set up a two-player safety game here. 
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specified by the strategy. In eacii state, tlie enabled action defines the output function of the system. 
Instead of deciding the next state probabiUstically, the system moves from one to the next state depending 
on the chosen input value. 

In the next section we show how to compute an optimal strategy for a given MDP in time polynomial 
in the number of states. This result together with construction above leads to the following theorem. 
Theorem 2 (Synthesis). Given an automaton £^ with alphabet LxA defining a qualitative language , 
an automaton ^ with alphabet LxA and two cost functions ci and C2 defining a quantitative language, 
and a L-labeled MDP ^ defining an environment model, we can compute an optimal system 5^ with 
respect to .5^^ and in time polynomial in\.s^\-\,%\- \^\. 



4 Calculating the best strategy 

In this section we will first outline a proof showing that for every MDP there is a pure and memoryless 
optimal strategy for our payoff function. To this end, we argue how the proof given by ||25l can be 
adapted to our case. After that we will show how we can calculate an optimal pure and memoryless 
strategy. 

4.1 Pure and memoryless strategies suffice 

In ||25]| . Gimbert proved that in an MDP any payoff function mapping to M that is submixing and prefix 
independent admits optimal pure and memoryless strategies. Since our payoff function may also take 
the value oo, we cannot apply the result immediately. However, since Si maps only to non-negative values 
and the set of measurable functions is closed under addition, multiplication, limit inferior and superior 
and division, provided that the divisor is not equal to 0, the expected value of ^ is always defined and the 
theory presented in ||25| also applies in this case. Furthermore, to adapt the proof of ||25| to minimizing 
the payoff function instead of maximizing it, one only needs to inverse the used inequalities and replace 
max by min. What remains to show is that £^ fulfills the following two properties. 
Lemma 3 {Si is submixing and prefix independent). Let ^ = (5,A,A,p) be a MDP and p be a run. 
L For every i > the prefix of p up to i does not matter, i.e., S?{p) = S?{piPi^i . ..). 

2. For every sequence of non-empty words uq,vq.,u\^v\ - ■ ■ E (A x 5)+ such that p = uqvqu\V\ ...we 
have that the payoff of the sequence is greater than or equal to the minimal payoff of sequences 
U{)U\ . . . and vqVi . . ., i.e., ^(p) > min{^(MoMi . . . ),=^(voVi ...)}. 

Proof. The first property follows immediately from the first limit in the definition of M. 

For the second property we partition N into U and V such that U contains the indexes of the parts 
of p that belong to a Uk for some ^ € N and such that V contains the other indexes. Formally, we define 
U := U(6N^' where [/q := € N [ < A; < \uq\} and [/,• := {max{Ui-\_) + |v,-_i| +A: | 1 < A; < Let 

V := ?7 \ N be the other indexes. 

Now we look at the payoff from m to / for some m < / G N, i.e. := (L(=m.../'^i(P())/(l + 
'Li=m...iC2{Pi))- We can divide the sums into two parts, the one belonging to U and the one belonging to 

V and we get 
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We now define the sub-sums between the parentheses as ui := T,ie{m...i}nu (P;)' "2 := llie{m...i}nu ^2{Pi 
vi :=L-e{m.../}nyCi(p,-) and V2 := L-e{„,.../}ni/C2(p,-). Then we receive 



1+M2+V2 

We will now show 

> min 



Ml Vl 



t/2 + 1 ' V2 + 1 

Without loss of generality we can assume mi/(m2 + 1) > vi/ (v2 + 1), then we have to show that 



Ml+Vl ^ Vl 



1 + M2 + V2 V2 + 1 

This holds if and only if (mi +vi)(1 +V2) = mi +vi +u\V2 + v\V2 > vi(l + M2 + V2) = vi +V1M2 + V1V2 
holds. By subtracting vi and V1V2 from both sides we receive ui + u\V2 = mi(1 +V2) > M2V1. If U2 is 
equal to then this holds because u\ and V2 are greater than or equal to 0. Otherwise, this holds if and 
only if u\/u2 > vi/(l +V2) holds. In general, we have u\/u2 > u\/{u2 + 1). From the assumption we 
have ui/{u2 + 1) > vi/(v2 + 1) and hence mi/m2 > vi/(v2 + 1). The original claim follows because we 
have shown this for any pair of m and /. □ 

Theorem 3 (There is always a pure and memoryless optimal strategy). For each MDP with the ratio 
payoff function, there is a pure and memoryless optimal strategy. 



Proof. See 123 □ 



4.2 Reduction of MDP to a Linear Fractional Program 

In this section, we show how to calculate a pure and memoryless optimal strategy for an MDP with 
ratio objective by reducing the problem to a fractional linear programming problem. A fractional linear 
programming problem is similar to a linear programming problem, but the function that one wants to 
optimize is the fraction of two linear functions. A fractional linear programming problem can be reduced 
to a series of conventional linear programming problems to calculate the optimal value. 

We present the reduction only for unichain MDPs. The extension to general MDPs is based on 
end-components [1] and the fact that end-components have an optimal unichain strategy. 

Our reduction uses the fact that an MDP with a pure and memoryless strategy induces a Markov 
chain and that the runs of a Markov chain have a special property akin to the law of large numbers, 
which we can use to calculate the expected value. 

Definition 4 (Random variables of MCs). Let p" (s) be the probability of being in state s at step n and 
let p*{s) := lintn^oo^YJIlQ p'{s). This is called the Cesaro limit of p". Let further v" denote the number 
of visits to state s up to time n. 

We have the following lemma describing the long-run behavior of unichain Markov chains ||3T1|291 . 

Lemma 4 (Expected number of visits of a state and well-behaved runs). For every infinite run of 
a unichain Markov chain, the fraction of visits to a specific state s equals p*{s) almost surely, i.e., 

P(hm/^oo Y — P*{^)) — 1- ^^'^ of runs that have this property well-behaved. 

When we calculate the expected payoff, we only need to consider well-behaved words as shown in 
the following lemma. 



28 



Synthesizing Systems with Optimal Average-Case Behavior for Ratio Objectives 



Lemma 5. Let N denote the set of runs that are not well-behaved. Then 

Proof. The probability measure of the set of well-behaved words is 1 . Hence the probability measure of 
the complement of this set, i.e., N, has to be 0. Sets hke these are called null sets. A classical result says 
that null sets do not need to be considered for the Lebesgue integral. □ 

For a well-behaved run, i.e., for every run that we need to consider when calculating the expected 
value, we can calculate the payoff in the following way. 

Lemma 6 (Calculating the payoff of a well-behaved run). Let p be a well-behaved run of a unichain 
Markov chain. Denote by Tt : S ^A the only action available at a state. Then 

^ LsesP*{s)ci(s,7t{s)) 
hmi^„ \ +LsesP*{s)c2{s, 7r{s)) 

Proof. By definition of ^ we have 

^(p) = lim liminf --^l^i-^^ 

We now assume that the Markov chain consists of one maximal recurrence class. We can do this because 
every non-recurrent state will not influence <^(p), because p is well-behaved and because 1% is prefix 
independent. Hence 

^(p)=liminf 

We can calculate the sums in a different way: we take the sum over the states and count how often we 
visit one state, i.e., 

ELoCi(Pz) ^ I,g5Ci(^,7r(.v))vi ^ i:sesCi{s,Tt{s)){vl/l) 
l+lUcilPi) ^+LsesC2{s,n{s))vl \/l + LsesC2{s,n{s)){vl/l) 

Now we take lim instead of lim inf. We will see later that the sequence converges for / — > oo and 
hence Um and liminf have the same value. Because both sides of the fraction are finite values we can 
safely draw the limit into the fraction, i.e.. 



(t)lim 



^sesCi{s,K{s)){vl/l) \_ lim/^~(LvesCi(^,7r(^))(vi//)) 



l^'»\l/l + ZsesCi{s,%{s)){vl/l) J \mii^^{\/l + Z.sesC2{sMs)){vi/l)) 



Um/^oo(l//) +£,g5C2(j, ;r(j)) Um/^oo(vi//) 

v' 

Finally, by the definition of well-behaved runs we have lim/_j,«> -f = p*{s). Hence 

L.e5Ci(^.^(^))lim/^°°(vi/0 ^ l^sesci{s,%{s))p*{s) 

lim/_^oc ( 1 //) + C2 (5', TT (5') ) hm/^oo ( vj //) hm/^„o /I) + Y.sesC2{s-,Tt{s))p* [s) 

The limit diverges to oo if and only if the second costs are all equal to zero and at least one first cost is 
not. In this case the original definition of ^ diverges and hence ^ and the last expression are the same. 
Otherwise the last expression converges, hence t converges, ergo Uminf and lim of this sequence are the 
same. □ 
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Note that the previous lemma impUes that the value of a well-behaved run is independent of the 
actual run. In other words, on the set of well-behaved runs of a unichain Maikov chain the payoff 
function is constant- Ergo the expected value of such a Markov chain is equal to the payoff of any of its 
well-behaved runs. 

Theorem 4 (Expected payoff of a MDP and a strategy). Let ^ he a MDP such that every pure and 
memoryless strategy induces an unichain MC. Let further p* denote the Cesaro limit of p" of the induced 
Markov chain. Then for every pure and memoryless strategy 71 

^71 r^i _ L,esci{s,7l{s))p*{s) 



hm;^oo {l/l)+'LsesC2{s,7l{s))p*{s) 

Proof. This follows from the previous lemma and the fact that ^ is constant on any well-behaved run. 

□ 

Note that this means that an expected value is oo if and only if the second cost of every action in the 
recurrence class of the Markov chain is and there is at least one first cost that is not. 

Using this lemma, we are now able to transform the MDP into a fractional linear program. This is 
done in the same way as is done for the expected average payoff case (cf. [30]). We define variables 
x{s,a) for every state s G S and every available action a G A{s). This variable intuitively corresponds 
to the probability of being in state s and choosing action a at any time. Then we have, for example 
P*is) = Y,aeA(s)^i^^'^)- need to restrict this set of variables. First of all, we always have to be in some 
state and choose some action, i.e., the sum over all x{s,a) has to be one. The second set of restrictions 
ensures that we have a stationary distribution, i.e., the sum of the probabilities of going out of (i.e., being 
in) a state is equal to the sum of the probabilities of moving into this state. 

Definition 5 (Fractional Linear program for MDP). Let ^ he an unichain MDP such that every Markov 
chain induced by any strategy contains at least one non-zero second cost. Then we define the following 
fractional linear program for it. 

. . ZsesLaeA{s)4s.a)ci{s,a) 
Minimize ^ r (4) 

I.sesI.aeA{s)^y^^a)c2[s,a) 

subject to 



l.aeA{s)^i^^'^) = Lv'e5l«eA(.v')-^('^''^)^('^''^)('^) ^^^^ (6) 

There is a correspondence between pure and memoryless strategies and basic feasible solutions to 
the linear program^. That is, the hnear program always has a solution because every positional strategy 
corresponds to a solution. See ll30l for a detailed analysis of this in the expected average reward case. 

Once we have calculated a solution of the linear program, we can calculate the strategy as follows. 

Definition 6 (Strategy from solution of linear program). Letx{s,a) be the solutions to the linear program. 
Then we define the strategy as follows. 

^Note that the fact that any payoff function that is prefix-independent is constant almost surely on each irreducible Markov 
chain has already been proved by II25I 

^ A feasible solution is one that fulfills the linear equations that every solution is subject to. 
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\ arbitrary if x(s, a) =0 for every enabled action a 
n[s) = < 

\a ifx{s,a) > 

Note that this is well defined because for each state s there is at most one action a such that a) > 
because of the bijection (modulo the action of transient states) between basic feasible solutions and 
strategies and because the optimal strategy is always pure and memoryless. 

4.3 From LFP to LP 

Since solvers to linear fractional programs are not common and there are good free solvers to linear pro- 
grams, we presented a method of converting a linear fractional program to a sequence of linear programs 
that calculate the solution. This algorithm is due to ||27]| . Let f{x) denote the value of Eqn. |4] under 
variable assignment x. 

Input: feasible solution xq, MDP ^ 

Output: Variable assignment, optimal solution 

n^O 

repeat 

g^f{Xn) 

n ^^n + \ 
Solve 

Minimize^ ^ Xn{s,a)c] - g^^ ^ Xn{s,a)cl 

subject to Eqn. [5] and Eqn. |6l 
until f{xn-i) =f{x„y, 
return x„, /(x„) 

4.4 Preliminary Implementation 

We have developed a tool that can handle (finite) unichain MDPs with ratio objectives based on the 
approach presented in this paper. Our tool is implemented in Haskell and uses the GA'^L' Linear Pro- 
gramming Kit to solve the resulting linear programs. 

We made some initial experiments using the server-client example from Section |3] In the case of 
two clients we have a MDP with 24 states and 288 edges. Building and solving this system takes less 
than 100 milliseconds on a Laptop with an Intel Core 2 Duo P8600 clocked at 2.40 GHz. The resulting 
machine behaves as follows: If it receives only one request at the start, then it acknowledges this request 
immediately. Whenever Client 2, i.e., the complicated client, sends a request, then it also receives the 
acknowledgment, with one exception: When Client 1 has an outstanding request, i.e., if its qualitative 
specification is in state s^, and if Client 2 has no outstanding request, then Client 1 receives the acknowl- 
edgment. The expected value is roughly 1.2 = 12/10. This means that, out of 12 requests, 10 can be 
served, which means 83.3%. 

5 Conclusions and Future Work 

We have presented a technique to automatically synthesize system that satisfy a qualitative specification 
and optimize a quantitative specification under a given environment model. Our technique can handle 
qualitative specifications given by an automaton with a set of safe states, and quantitative specifications 
defined by an automaton with ratio objective. 
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Currently, we are working on a better representation of the input specifications. In particular, we 
are aiming for a symbolic representation that would allow us to use a combined symbolic and explicit 
approach, which has shown to be very effective for MDP with long-run average objective [32 1. Further- 
more, we are extending the presented approach to qualitative specification describe by arbitrary co-regular 
specifications. 
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